DOCDHBBT 5ES0BB 



ED 190 604 



TB BOO 396 



&OTHOB 
TITLE 

INSTITOTION 

SPONS 

POB DATE 
CONTHaCT 
NOTE 



EDBS PRICE 
DESC8IPT0FS 



IDENTIFIERS 



Bold, GaXe: &nd Others 

K Conparlsoa of itea-Mrlting Bethods for 

Criterion-Bef erenced Tests. 

Oregon state System o£ Higher Education^ ttoniaottth. 
Teaching Besesrch Div. 

Advanced Research Proiects Agency (DODK Washington, 

D-C. 

Apr SO 

BDA-903-77-C-01B9 

2t(p*: Paper presented at the joint Annual Meetings o£ 
the American Educational Research Association and the 
National Council on Beasureaent in Education (Boston, 
BA^ April 7-11, 1980) • 

BF01/PC01 Plus Postage. 

♦Criterion Referenced Tests? ♦Difficulty Level: 
Eleaentary Education: Elecientary School Teachers: 
♦Item Analysis: Pretests Posttests; Prose; Reading 
Conprehension: ♦Reading Tests: Test Construction: 
♦Test Poraat: ♦Test Items 
♦ Distract ors (Tests) 



ABSTRACT 

Osing informal, objectives-based, or linguistic 
methods, three elementary school teachers and three experienced item 
writers developed criterion-referenced pretests-posttests to 
accompany a prose passage. Item difficulites vere tabulated on the 
responses of 36«» elementary students. The informal- subjective method, 
used by many achievement test developers, allowed maximum freedom of 
wording and yielded significantly more difficult items. The 
objectives-based and linguistic methods, in which the item writer 
chose the foils (distractors) , were susceptible to item writer bias. 
In contrast, the method which provided maximum control, 
linauistic-based algorithmic foil method, yielded the items which 
were easy and not subject to bias. It also created items vhich were 
insensitive to the pretest-posttest shift in difficulty. Therefore, 
algorithmic foil methods are promising because they control item 
writer differences; more research is needed before reasonable item 
difficulties and Instructional sensitivity can be obtained. 
Teacher- prod uced items were more sensitive to instruction than those 
of experienced writers. It is concluded that item-writer differences 
are real and that field testing is important to identify these 
differences. (CP) 



♦ Reproductions supplied by EDRS are the best that can be made ♦ 

♦ from the original document. ♦ 



A COMPARISON OF ITEM-WRITING METHODS FOR CRITERION-REFERENCED'''f ES 
Gale Rold, Tom Haladyna and Joan Shaughnessy 
Teaching Research Division 
Oregon State System of Higher Education 



EDUCATION 4 WitrAHE 
NATIONAL INSTITUTE OF 
EDUCATION 

TMtS DOCUMENT HAS BEEN 9EPR0. 
OUCEO EXACtLY AS »ECG»VEO PROM 
TMf PE»VON OR ORGANIZATION ORlGtN- 
ATiNGiT POfNTi Df^ VIEW OR OPINIONS 
STATED 00 NOT NECESSARILY RERRE- 
SPNT Of r IClAL NATIONAL INSTITUTE OP 
iTiON OR POLICV 




.•PERMISSION TO REP^OD"*;!^ 
MATERIAL HAS BEEN GRANTED BY 

send Correspondence to: 
Tom Haladyna 

Teaching Research Division 

Oregon State System of Higher Education 

Monmouth, Oregon 97361 



Running Head: A Comparison of Item-Writing Methods 



A Comparison of Item-Writing 

1 

ABSTRACT 

Statistical qualities of items were compared across six item writers 
who used informal, objective and linguistic methods of item writing. 
Items were written for pretests and posttests to accompany a prose passage. 
Item difficulties for each type of item were tabulated on the responses of 
364 elementary students. Foil-writing methods had significant effects on 
the pattern of item difficulties of the resulting items. Informal methods 
of item writing resulted in large differences between item writers. The 
study indicates the importance of field testing to identify possible item- 
writer differences in criterion-referenced tests. 



A Comparison of Item-Writing 



A COMPARISON OF ITEM-WRITING METHODS FOR CRITERION-REFERENCED TESTS 
A number of measurement theorists (Bormuth, 1970; Hambleton, Swaminathan, 
Alglna & Coulson, 1978; Hively, 197^; Millman, 197^; Popham, 1975) have con- 
vincingly argued that cri terion- referenced tests should be based on a 
scientific Item-writing technology. This technology begins with domain 
specification, which is a delineation of the content, subject matter, or 
job tasks to be tested. For a given domain, a universe or large pool of 
test i teins is defined. This universe of test items may be created by an 
item form (Hively, 197^; Osburn, 1968) or by computerized algorithms (Millman 
6 Outlaw, 1977) or by other methods involving rules or operational defini- 
tions of Item-writing methods (Bormuth, 1970). A cri terion- referenced test 
Is then created by taking a random sample of items from the universe to 
produce a test form. The score obtained on this random sample of items is 
a best estimate of a student's performance on the entire domain. Therefore, 
a cri terion- referenced test is defined as assessing "an individual's status 
[referred to as a domain score] with respect to a well defined behavior 
domain" (Hambleton et al., 1978, p. 2). 

A number of writers, including Anderson (1972), Bormuth (1970) and 
Millman (197^), have argued that the method of writing tests from written 
statements of learning objectives does not always include precise domain 
specification and, therefore. Is susceptible to item-writer differences. 
Roid and Haladyna (1978) demonstrated that two experienced item writers 
showed significant differences in the difficulty of the items they produced, 
even though they wrote Items using the same learning objectives or loosely- 
defined rules. Further studies (Roid 5 Finn, 1978; Roid, Haladyna S 
Shaughnessy, 1978; Roid, Haladyna, Shaughnessy 5 Finn, 1978) have further 



ERIC 




A Comparison of Item-Writing 

3 

documented the existence of item-writer differences, as well as the poten- 
tial for possible control of differences^ through algorithmic item-writing 
techniques. 

One emerging technology of criterion-referenced test development has 
been in the area of item writing for prose learning (Bormuth, 1970). Bor- 
muth described a technology for transforming sentences or elements from 
prose passages into test questions to assess reading comprehens ion • This 
linguistic-based theory of i tern development has subsequently undergone 
further research (Finn, 1975; Roid & Finn, 1978; Roid, Haladyna, Shaughnessy 
& Finn, 1978)* There are many learning objectives or instructional systems 
that require the learning of information through the reading of prose 
passages or other human verbal communications. The basic elements in these 
Instructional stimuli are sentences which are countable. This collection of 
countable units can define a domain of content for a course of instruction. 
Clearly the most precise way to have tests that match the learning objec- 
tives and teaching materials is to transform elements of the teaching 
materials into questions. As Bormuth has described, by transforming sentences 
from materials, it is possible for the difficulty of the test to be matched 
exactly to the readability level of the materials. Therefore, a pure form 
of cri terion- referenced tests can be developed for reading comprehension. 

The purpose of the present study was to contrast informal, objective- 
based and linguistic-based methods of writing test items for criterion- 
referenced tests. The specific research questions of the study were: 

K What differences exist in the characteristics of test items created 
by various item-writing strategies which ranged from informal -subject i ve 
methods to algorithmic methods? 



5 



A Coirparison of Item-Writing 

2. To what extent do item-writer differences exist as a function of 
these various i tem--wri t ing strategies? 

3. Is there a difference in the quality of items produced by 
experienced item writers as contrasted with teachers for whose students 
the instructional materials have been selected and assigned? 

APPROACH 

Subjects 

Participants In this study were 364 elementary school students and 
their teachers from four Oregon public schools. The students ranged con- 
siderably in reading ability. Cooperating teachers were asked to select 
students whose reading ability was sufficient to read the text. In some 
instances, teachers read the prose passage to some of their students. 
Students in this study were grouped into a variety of instructtbnal settings 
(e.g., ungraded or graded). There were no factors present that suggested 
that this sample of students was significantly different from all students 
of this age and range of reading levels. 
Instruments 

Test items were developed from a high interest prose passage in a 
popular wi Idl ife magazine describing the characteristics of sharks. The 
reading difficulty level of the passage was approximately fifth grade. 

To examine the effects of item-writer variability, six item writers 
wrote items to assess reading comprehension of the prose passage* Three 
item writ'irs were researchers who were professionally experienced in item 
development; the other three writers were elementary school teachers. Each 
of the six item writers wrote three items for 12 separate test forms, 
representing different item-writing techniques, which are listed in the left 
column of Table 1 . 



A Comparison of Item-Writing 

5 

On constructing the items for Form 1, the in forma I -subjective technique, 
no specifications for writing Items were given to the writers. For Forms 2 
and 3, six instructional objectives were given to the writers and each writer 
was asked to construct one item designed to measure achievement of that 
objective. Both Forms 2 and 3 contained three items from each writer. 

Copies of the prose passage were given to a sample of 17 elementary 
school teachers who were asked to mark with a yellow pen the sentences they 
believed were most important for their students to learn. Sentences "-Iiat 
were chosen by a majority of the 17 teachers were identified. The standard 
frequency index (Carroll, Davies s Richman, 1971) of each noun and adjective 
in the chosen sentences was obtained and all nouns or adjectives with a 
numerical index of 60 or less were identified as high information words, as 
done In the study by Roid and Finn (1978). Then, 18 rare nouns (high infor- 
mation nouns appearing only once in the passage), 18 key nouns (high infor- 
mation nouns appearing in the passage four or more times), and 18 high 
information adjectives were identified. Sentences which included these 
identified words were then used as a basis for item generation for Forms k 
through 12. 

For each Form, k through 12, each item writer was randomly assigned 
three sentences of the 18 identified sentences which included the high 
Information words. All writers participated in a training session review- 
ing the procedures escribed by Finn (1973) to transform prose sentences 
into item stems for multiple-choice questions. Using Finn's procedure, 
each item writer constructed three rare noun, three key noun and three 
adjective stems. 

Then writers prepared the sets of foils for each stem. In the first 
foil-construction technique, wri ter's-choi ce 1 (WCl), writers selected three 



A Comparison of Item-Writing 

6 

single~word foils as possible replacements for the correct answer. In 
wri ter's-choice 2 (WC2) , the writers were pernltted to substitute two or 
more words as foils for the correct answer* The third foil technique used 
an algori thm ^ '^ich supplied three individual foil words from the prose 
passage in the same semantic category as the answer word, using the method 
of Roid and Finn (1978)- 
Procedures 

Participating teachers volunteered to administer pretests to students 
identified as able to read the prose passage, provide some instruction, and 
administer a posttest following this instruction. In one classroom, instruct 
tion consisted of simply reading the passage, while in other classrooms, the 
material was incorporated as part of a larger unit of instruction. Thus, 
instruction varied considerably for this group of students. Mo attempt was 
made to isolate test results by classroom or methoc of instruction. 
Analys is of Data 

Student responses to test questions were tabulated and item analyses 
conducted. An analysis of variance using item difficulty (percentage of 
students getting the item correct) as the dependent measure was performed. 
The design for the study was a four- factor analysis of variance with the 
factors being: (a) 12 item-writing techniques, (b) six item writers, (c) two 
types of I tern writers, and (d) the repeated measures of pretest and posttest. 
The item writers, as a factor, were nested in the third factor, types of item 
wri ters. 

Interactions involving the repeated measure were examined closely because 
they suggest that a particular technique or combination of techniques Is more 
or less sensitive to changes in learning that occur as a function of Instruc- 
tion. Therefore, groups of items reflecting a particular item-writing 

8 



A Comparison of Item-Writing 

7 

strategy, item writer, type of item writer (or any combination of these 
three variables) may show unusually large or small pretest to posttest 
changes which suggest that a technique or combination of techniques Is more 
or less effective at measuring this change in achievement. 

RESULTS AND DISCUSSION 
Analysis of Variance Results 

The results of the analysis of variance revealed five statlsii cal ly 
significant results: 

1. Item-writing technique, £- 3-59; df " ^ ^ '^^5 £ '^^^ ' 

2. The Interaction of item-writing technique with the nested factor. 
Item writers, £- 1.79; If. - B." -006. 

3. The interaction of item-wrlcing technique with item-writer type, 
£ « 2.31 ; df - 11, 14A; £ - .012. 

4. The repeated measure, £- 208.79; df - 1, \kk; £<-.00l. 

5. The interaction of Item-writer type with the repeated measure, 
F - 4.02; df - 1 . U4; £ - .027- 

All other main effects and interactions were not statistically signifi- 
cant (£ > .05) . 

item-writing techniques . These item-writing techniques ranged from the 
informal, subjective technique that most achievement test developers have 
used in the past to several examples of the linguistic-based methods proposed 
by Bormuth (1970) and refined by Roid and Finn (1978). 

Means and standard deviations for these 12 techniques across all con- 
ditions of the study ap ear in Table 1. 

Insert Table 1 about here 



9 



A Comparison of Item-Writing 

8 

As recommended by Winer (1962, pp. 77*85) a studentized range statistic 
was employed to contrast the 12 levels of this independent variable. All 
pairs of means were contrasted and only four item-writing techniques were 
found to be significantly flifferent (p < .05)* The four techniques and 
their mean difficulties were: 

Sub j ec t i ve- i n f o rma 1 ^ 1 ^ 

Adjective stem with Wri ter's-Choice 2 foils kk% 
Key noun stem with algorithmic foils $7% 
Rare noun stem with algorithmic foils Sk% 
Thus ^ it seems that the subjective- in formal technique leads to significantly 
more difficult items, while the linguistic-based techniques vary considerably 
in difficulty as a function of the stem and the foil techniques. What 
emerges from this analysis is the fact that two of the three item-writing 
strategies involving al go rithm- gene rated foils were significantly easier than 
what resulted when using the subject ive- informal technique. All three con- 
ditions involving the use of the algorithm-generated foils yielded above 
average difficulties. This is probably due to the fact that these algorithms 
often introduce obviously incorrect foils so that the student who does not 
know the correct answer can often deduce the correct answer by eliminating 
these obviously incorrect foils* These results also indicate that item 
difficulty Is often a function of the method by which Items are created, a 
finding supported In previous studies (Roid & Haladyna, 1978; Roid, Haladyna 
& Finn, 1978; Roid, Haladyna & Shaughnessy, 1978). 

Item-writing technique x itemwriter > An examination of this result is 
complicated by the fact that there are 72 cells in this interaction. Winer 
(1962, p. 232) recommends tests of simple effects to locate the specific 
sources of significant variation. Unfortunately, tests of simple effects 

10 



A Comparison of I'tenrWri ting 

9 

for each item-writing strategy across the six item writers reveals large 
and significant IF ratios (£ < .001) for each of the 12 strategies. Con- 
sequently, a test of practical significance was enployed which is more 
arbitrary but emphasizes the extent to which certain Item-writing strate- 
gies interact with Item writers to produce excessively hard or easy items. 
The criterion employed was one involving those entries in Table I for the 
interaction of item-writing strategies and Item writers where the means 
exceeded one standard deviation (16) from the grand mean of 50. Those 
means observed one standard deviation from the grand mean are underlined In 
Tab 1 e 1 . 

Several findings are suggested from these underlined means. Two out of 
three of the experienced item writers produced very difficult items using 
the informal-subjective method of item writing, in contrast to the teachers 
who produced items of moderate difficulty. Clearly, these demonstrated that 
an open-ended item-writing technique can have a dramatic effect on item 
characteristics. Furthermore, the item writer who wrote the most difficult 
items overall (experienced item writer #2, with a mean of 44 overall), 
created very difficult items using three techniques involving item-writer 
frecdom--the subjective- informal , the objective-based and wri ter's-choice 
version #1 where single rare nouns were provided by the item writer. An 
intriguing reversal occurred, however, in that the same item writer had very 
easy items produced with the adjective-stem, algorithmic foil method. This 
result, coupled with the fact that three other underlined means in Table 1 
were for the algorithmic foil methot (76, 76 and 72), suggests that the 
algorithmic foil method affects item writer uias, but does so occasionally 
by overcorrectlng and creating very easy items. 



A Comparison of Item-Writing 

10 

From another perspectivet the algorithmic method resulted in extremely 
easy items in four out of 18 cases (three techniques by six I tern wri ters) , 
but the use of algorithms never resulted in items that were extremely diffi- 
cult* In contrast* when experienced item writers used informal or objective-* 
based techniques, they created very difficult items in three of nine cases, 
as represented by the nine entries in the upper left of Table 1. Thus, these 
results reinforce the concept that the idiosyncracies of the item writer, if 
given a chance to shine through, will have an effect on item difficulties. 

Item^wrttfng technique x type of i tern wri ter > The preceding inter- 
action dealt with the factor of item writers. The nested factor was type 
of item writers. The first three item writers were experienced test-item 
writers, while the last three were teachers. As shown in Table 1, a pat^ 
tern exists In the interaction of item-writing strategy and type of item 
writer. Again, a test of simple effects was performed to ascertain under 
what item^writing strategies the two types of item writers differed. Statis^ 
tically significant simple effects were detected for three item-writing 
strategies: (a) informal-subjective, ^ 7.89, .006; (b) rare noun/ 
writer's choice 1, ^•SSt £" .03^; and (c) adjective/algorithm, 7*39> 

.007. f^or the informal-subjective method, the experienced item writers 
wrote significantly more difficult items. In the case of the rare noun/ 
wri ter's-^choice 1, a single difficult item by item writer 2 from the experi- 
enced writer group was sufficiently low to produce this statistically sig- 
nificant interaction. The third simple effect was a surprising reversal of 
the tendency of the experienced item writers to write more difficult items. 
That is, experienced item writers produced significantly easier items when 
using the adject ive-*s tern technique with algori thmlcal ly generated foils. 
This unique result reinforces the finding mentioned previously that the 

18 



ERIC 



A Comparison of Item-Writing 

1 1 

algorithmic-foil technique can apparently overcompensate for i tem-wri -^r 
effects by creating items that are too easy. A second explanation of this 
result is that the teachers found the adjective-stem ite^ns to be the most 
challenging of the 1 ingui:tic-based methods. This was apparer-tly dua to the 
fact that odd changes in wording are sometimes required to transform a 
sentence into a question when an adjective has been deleted rather than a 
noun. 

Type of item writer and the repeated measure . As reported earlier, th^e 
main effect of the repeated measure was highly significant (F • 208.79; df. " 1 » 
£< .001). Instruction was effective to the extent that pretest and posttest 
means differed, 1*2% on the pretest and 58% on the posttest. This gain of 
16%, while statistically significant, is not the kind of gain typically ob- 
served in a variety of instructional settings (Haladyna & Roid, 1978). How- 
ever, interactions of any other independent variable with the repeated 
measures are one form of evidence that combinations of techniques are more or 
less effective in measuring achievement that was the target of instruction. 
In other words, test item groups that obscured this 16% increase in learning 
are viewed as less reflective of what occurred and, therefore, less sensitive. 

The cell and marginal means and standard deviations for the one statis- 
tically significant Interaction involving the repeated measures are presented 
in Table 2. As shown there, teachers produced items showing the greatest 
sensitivity to instruction when compared to experienced Item writers. This 
points out quite dramatically that the type of item writer can affect an 
Important item characteristic, sensitivity to instruction. If this shift 
can be construed as an indicator of the quality of test I terns (the tendency 
to detect real change in performance as a function of learning), then teachers 
wrote slightly better items than experienced Item writers In this experiment. 

13 



A Comparison of Item-Wri t Ins 



12 



insert Table 2 about here 



I tem Analyses 

As a means of further e;tamining factors which potentially underlie the 
results of this study. Item analyses were conducted using an index originally 
introduced by Cox and Vargas (1966) and recommended by Haladyna and Roid 
(1978) as useful and appropriate for cri ter ion- referenced tests- 

The index used was the pre-to-post difference index (PPDI), the differ- 
ence in the pretest and posttest difficulties of an item. Those I terns 
with PPDI's less than the arbitrary criterion .10 were identified as poten- 
tially defective. PPOI's vary from -1.00 to +1.00. A positive PPOl 
indicates that the item reflects those changes in instruction attributable 
to learning, while low PPOI's or negative PPDI's may be attributed to either 
(a) inadequate instruction, or (b) a defective Item. 

First, items with PPDI's less than JO were Identified, \hen these 
items were subjected to inspection to uncover aspects of these items that 
may have contributed to their low PPDI's as a function of type of item-writing 
technique, item writer or type of item writer. No items were found to be 
miskeyed or co contain obvious item-writing defects. Thus, it seems reason- 
able to conclude that lack of sensitivity for any item was due to either 
insufficient instruction or a fault in the method used to produce these items. 
Consequently, the discussion that follows is based on an attempt to uncover 
systematic trends that can be attributed to peculiarities of Item-writing 
techniques, recognizing that overall the instruction in this experiment was 
of moderate effectiveness. 

PPOI's spanned a very wide range, from -.56 to +.62. Of the 216 items 
written by six item writers, 63 were found to have PPDI's below .10. An 



ERIC 




A Comparison of Item-Writing 

13 

analysis of these low sensitivity items by item-writing technique revealed 
that most techniques yielded from three to five Insensitive items, including 
the informal-subjective, objective-based, and linguistic approaches. 

Looking at the proportions of Insensitive items by item writers, the 
range was small, from 22^ to 39^ » with the number of insensitive items 
ranging from 8 to 14. The factor of Item writers was nested in the factor 
of types of item writers — experienced Item writers produced 31^ insensitive 
items when compared to teachers who produced 26^. This finding is In agree- 
ment with the earlier report of the interaction of the type of item writer 
and the repeated measure. 

With regard to fol 1 -construct I on technique, the largest proportions of 
insensitive items were found with the algorithmic-foil method. This clerical 
and automated method of assembling foils by selecting words of similar 
semantic categories from the passage produced the greatest proportions of 
low sensitivity I terns found In the study, W, 39%, and 50%. This clearly 
demonstrates that automated foi 1 -generation techniques used with natural 
language questions are challenging, and will require greater refinement 
than the technique implemented in the present study. 

CONCLUSIONS 

The conclusions of the study can be summarized in relation to the three 
main research questions of the study dealing with differences between Item- 
writing techniques. Item-writer bias, and types of item writers. 

Item-writing techniques . The technique which provided the maximum 
amount of freedom of wording to the Item writer was the informal -subject 1 ve 
method in which no objectives or rules of Item writing were used. This 
method resulted in significantly more difficult items, and was susceptible 
to large differences between Item writers. Also, objective-based and 



A Comparison of Item-Writing 

]k 

linguistic methods in which the Item writer chose the foils were found to 
be susceptible in some instances to item-writer bias. In contrast, a 
reversal was found with the techniques which provide the maximum amount of 
control of I tanrwri ter di f ferences — the linguistic-based, algorithmic-foil 
method. In this method, sentences from the prose passage are transformed 
into questions by the item writer, and then foils are added clerically by 
taking them from word lists created from the passage. This method resulted 
in the easiest items, and created the highest proportion of items that were 
insensitive to the pretest-posttest shift in difficulty. These findings 
suggest that some degree of item-writer choice in wording may be necessary 
in the context of writing items for reading comprehension, until foil-writing 
algorithms can be developed that are more sensitive than those used in the 
present study. 

Perhaps the technique that survived the comparisons of the study with 
the fewest limitations was the linguistic-based method involving the use of 
noijns and the second wri ter' s-choice method of foil construction. In this 
technique, item writers transformed sentences into questions by deleting a 
noun phrase and then forming foils by inserting a noun phrase in place of 
the one deleted. This method resulted in good instructional sensitivity 
(average of .20 difference between pretest and posttest item difficulties) 
and a relatively homogeneous level of item difficulties across item 
writers. 

Item-writer differences . The finding that al gor i thmic-foi I construc- 
tion, as implemented in this study, leads to items that are very easy is 
similar to that found in earlier studies by Roid and Finn (1978) and Roid, 
Haladyna, Shat-ghnessy and Finn (1978). The one advantage of algorithmic 



16 



A Comparison of Item-Writing 

15 

methods of foil construction, which could not be examined in the present 
study, but which was established in Roid, Haladyna, Shaughnessy and Finn 
(1978), is that they control the variability of item difficulties between 
item writers. Therefore, algorithmic foil methods may be promising in that 
they control item-writer differences, but need to be designed so that 
reasonable item difficulties and instructional sensitivity can be obtained. 
Further research and development in fo 1 1 -const ruction techniques is clearly 
needed . 

Types of item writers . It would be tempting but, of course, imprudent 
to conclude fom this study that the better item writers were teachers, 
who may know their students more personally than a group of experienced item 
writers. The more important meaning of the finding that item writers of 
different backgrounds differed significantly in the resulting instructional 
sensitivity of their items is that item-writer effects are real and signifi- 
cant. The challenge Is to develop tests with careful specification of the 
item-writing methods to be used, so that these biases can be identified and 
isolated. The only way these effects can be identified is through field 
testing of items on students. Thus, the present study can be seen as pro- 
viding evidence for the importance of empirical item review. Not that items 
are to be selected or discarded from a domain on the basis of their statis- 
tical qualities, but rather that the I tern specifications be changed if 
faulty items result. Documented Item-writing methods can lead to criterion- 
referenced tests that have the desirable property of being random samples 
of items from a well-specified domain (Hambleton et al., 1978, p. 38). 
Field testing can then provide quality control through the assessment of 
the statistical qualities of items written by different people and different 
methods . 



A Comparison of I tetn-Writing 

16 

REFERENCES 

ANDERSON, R. C. How to construct achievement tests to assess comprehension. 
Review of Educational Research , 1972, 42, 145-170. 

BORMUTH, J. R. On the theory of achievement test items . Chicago* Univer- 
sity of Chicago Press, 1970. 

CARROLL, J. B., DAVIES, P., & RICHMAN, B. Word frequency book . Boston: 
Houghton-Mifflin, 1971. 

COX, R. C, & VARGAS, J. A comparison of item selection techniques for 
norm- referenced and criterion-referenced tests . Paper presented at the 
annual meeting of the American Educational Research Association, San 
Francisco, April 1966. 

FINN, P. J. A question writing algorithm. Journal of Reading Behavior . 

1975, il, 341-367. 

FINN, P. J. Generating domain-referenced, multiple-choice test items from 
prose passages . Paper presented at the annual meeting of the Ameri :an 
Educational Research Association, Toronto, March 1978. 

HALAOYNA, T., & ROID, G. The role of instructional sensitivity in the 
empirical review of criterion-referenced tests . Paper presented at the 
annual meeting of the American Educational Research Association, 
San Francisco, April 1976. 

HAMBLETON. R. K. , SWAMINATHAN, H., ALGINA, J., 6 COULSON, D. B. Criterion- 
referenced testing and measurement: A review of technical issues and 
developments. Review of Educational Researc h. 1978, 48^, 1-47. 

HIVELY, W. Introduction to domain- referenced testing. Educational Technology , 
1974, 14, 5-10. 

HILLMAN, J. CrI terlon-refcrenced measurement. In W. J. Popham (Ed.), 
Evaluation In education: Current applications . Berkeley, California: 
McCutchan Publishing Company, 1974. 

o 18 

ERIC 



A Comparison of Item-Writing 

17 

MILLMAN, J., & OUTLAW, W. S. Testing by computer . Ithaca, New York: 
Cornell University Extension Publications, 1977. 

OSBURN, H. G. Item sampling for achievement testing. Educational and 
Psychological Measurement . 1968, 28, 95-10^ 

POPHAM, W. J. Educational evaluation . Englewood Cliffs, N.J.: Prentice- 
Hall, 1975. 

ROIO. G. H., & FINN, P. J. Algorithms for developing test questions from 
sentences in instructional materials (NPRDC Tech. Tep. 78-23). San Diego: 
Navy Personnel Research and Development Center, 1978. 

ROID, G. H., & HALAOYNA, T. M. A comparison of objective-based and modified- 
Bormuth item writing techniques. Educational and Psychological Measure- 
ment . 1978, ii, 19-28. 

ROID, G., HALADYNA, T. , S FINN, P. A comparison of several multiple-choice . 
linguistic-based item writing algorithms . Paper presented at the annual 
meeting of the American Educational Research Association, New York, 
April 1977. 

ROIO, G., HALADYNA, T., & SHAUGHNESSY, J. A comparative study of informal 
objective-based and linguistic-based item-writing methods . Monmouth, 
Oregon: Teaching Research, 1978. 

ROID, G.. HALADYNA, T. , SHAUGHNESSY, J., & FINN, P. Item writing for domain- 
based tests of prose learning . Paper presented at the annual meeting of 
the American Educational Research Association, San Francisco, April 1979- 

WINER, B. J. Statistical principles in experimental design . New York: 
McGraw-Hill, 1962. 



19 



A Comparison of Item-Writing 

18 

FOOTNOTE 

This research was supported by contract number M0A-9O3-77-C-01 89 
from the Defense Advanced Research Projects Agency. Or. Pat-Anthony 
Federico of the Navy Personnel Research and Development Center, San Diego, 
was the technical monitor for this project. 

Views expressed in this article are those of the authors and not 
necessarily those of the supporting agencies. 

Acknowledgments are extended to Dr. John Bormuth of the University 
of Chicago, Dr. Patrick Finn of the State University of New York at 
Buffalo, and Dr. Jason Millman of Cornell University, who contributed to 
aspects of this research. Appreciation is also extended to Dr. Harold F. 
O'Neil, Jr., of the Army Research Institute, for his encouragement in the 
formative stages of this research. 



20 



Table 1 

Cell and Marginal Means and Standard Deviations of Item Difficulties^ for the Main Effects of I tern Writing 
Techniques (A). Item Writers (B) , Types of Item Writers (C) , and Interactions Between A and B, and B and C 







Experienced 


Item Writers 










Teachers 














1 




2 




3 




Total 


1 
1 




2 




J 




Total 


TOTAL 


1 LeiirWrl ting Techniques 








































n 


SD 


M 


SD 


M 


SD 


M 


SD 


M 

n 




M 




M 


SD 


M 


SO 


M 


SD 


1. 1 n forma {"Subject I vo 


26 


13 


3^1 


2 


37 


5 


33 


9 


57 


\k 


52 


18 


k2 


7 


51 


13 


41 


14 


2. Objective 1 


38 


k 


'•5 


12 


6'i 


20 


'•8 


17 


58 


9 


^7 


21 


k2 


15 


'i9 


15 


49 


16 


3. Objective 2 


50 


20 


30 


7 


58 


10 




17 


'*9 


8 


36 


13 


63 


9 


49 


15 


48 


16 


H, Rare Noun/WCl 


58 


10 


26 


23 


^5 


18 




21 


^2 


8 


75 


7 


53 


8 


56 


16 


50 


19 


5. Rare Noun/WC2 


55 


a 


^8 


22 


^»3 


10 


^9 


\k 


^5 


11 


71 


8 


56 


10 


58 


14 


53 




6. Rare Noun/Algori thm 


76 


s 


60 


16 


52 


17 




16 


59 


9 


76 


it 


59 


II 


65 


12 


64 


14 


7. Key Noun/WCl 


58 


8 


ko 


11 


iil 


10 


^6 


12 


^9 


15 


61 


2^ 


50 


7 


53 


16 


50 


14 


8. Key Noun/WC2 


58 


5 


39 


18 


39 


11 


^5 


\k 




9 




9 


56 


z 


50 


10 


48 


12 


9. Key Noun/Algorlthm 


6A 


15 


^6 


15 


5^ 


3 


56 


13 


62 


12 


A5 


5 


72 


8 


60 


14 


57 


13 


10. Adject Ive/WCl 


^6 


18 


53 


18 


56 


6 


51 


lt« 


^8 


28 


k\ 


II 


36 


to 


k\ 


17 


46 


16 


11. Adject ive/WC2 


^2 


15 


36 


28 


57 


10 


A5 


19 


^♦5 


16 


39 


il 


'»2 


10 


42 


11 


44 


15 


12. Adject. /Algorithm 


55 


20 


69 


23 


59 


11 


61 


17 


50 


\h 


kk 


10 


38 


17 


44 


13 


52 


17 


TOTAL 


52 


17 




19 


50 


13 


^9 


17 


50 


13 


53 


IB 


51 


14 


52 


15 


50 


16 



^Entries are means and standard deviations of average Item difficulties as calculated across 
>9^rete$ts and posttests. 



A Comparison of Item-Writing 

20 



Table 2 

Means and Standard Oavlattons 
for the Interaction of the Repeated Measure with Type of Item Writer 





Pretest 


Posttest 




Total 




M 


SO 


M 


SO 


M 


SD 


Experienced Item Writers 


^2 


19 


56 


17 


^9 


19 


Teache rs 


^2 


19 


61 


16 


51 


20 


TOTAL 


^2 


19 


58 


17 


50 


16 



33 



A Comparison of Item-Writing 

21 

AUTHORS 

ROlO, GALE. Address ; Horizon Associates, 763 Caroline Way N. , Monmouth, 
Oregon, 97361 . Title ; Consultant. Degrees ; A.B. Harvard University, 
M.A., Ph.D. University of Oregon. Special ization ; Measurement; 
Evaluation; Instructional improvement. 

HAUVDYNA, TOM Address ; Teaching Research Division, Oregon State System 
of Higher Education, Monmouth, Oregon, 97361. Title ; Research 
Professor. Degrees : B.A. Illinois State University, M.A. San Jose 
State University, Ph.D. Arizona State University. Special i zatlon ; 
Educational Measurement; Statistics; Research Methodology. 

SHAUGHNESSY, JOAN. Address ; Teaching Research Division, Oregon State 

System of Higher Education, Monmouth, Oregon, 97361. Title ; Instructor. 
Degrees ; B.A. Indiana University, M.A. Wichita State University, 
Ph.D. candidate Michigan State University. . Special ization ; Child 
Development. 



2^ 



